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Abstract 

Neural network algorithms simulated on standard computing platforms typi¬ 
cally make use of high resolution weights, with floating-point notation. However, 
for dedicated hardware implementations of such algorithms, fixed-point synaptic 
weights with low resolution are preferable. The basic approach of reducing the 
resolution of the weights in these algorithms by standard rounding methods incurs 
drastic losses in performance. To reduce the resolution further, in the extreme case 
even to binary weights, more advanced techniques are necessary. To this end, we 
propose two methods for mapping neural network algorithms with high resolution 
weights to corresponding algorithms that work with low resolution weights and 
demonstrate that their performance is substantially better than standard rounding. 
We further use these methods to investigate the performance of three common neu¬ 
ral network algorithms under fixed memory size of the weight matrix with different 
weight resolutions. We show that dedicated hardware systems, whose technology 
dictates very low weight resolutions (be they electronic or biological) could in 
principle implement the algorithms we study. 


1 Introduction 

1.1 Context 

Mapping floating point algorithms to fixed point hardware is a non trivial process. The 
choice of mapping method can have a major impact on the performance of the fixed 
point system. Standard neural network algorithms typically operate on floating point 
parameters when simulated on conventional hardware mm®. Special purpose hard¬ 
ware (such as FPGAs and neuromorphic chips) on the other hand commonly implement 
synapses with fixed point resolution and possibly a small number of bits per synaptic 
weight ElEElim How this kind of hardware can best implement neural network 
algorithms is an open question. 
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The highly related question of how biological neural networks function under lim¬ 
ited synaptic resolution has attracted significant attention in the neuroscience commu¬ 
nity. It has been argued that limited synaptic resolution has profound effects on the 
learning capacity of networks that use them mmmm. This calls into question 
whether it is plausible to think of the algorithms performed by biological neurons as 
equivalent to artificial neural nets (ANN). This analogy particularly applies to deep 
ANNs simulated with high resolution synaptic weights, which have been shown to be 
highly predictive of neural responses in visual cortex ED 

We propose methods that allow artificial neural network algorithms to work with 
very low resolution synaptic weights using techniques from integer programming and 
image compression. 

1.2 Related Work 

In the computational neuroscience domain, a method for using low resolution synapses 
is presented in DU, in which a spiking neural network is trained using an STDP learn¬ 
ing rule. However |[T4l is only applicable to one specific learning rule and algorithm, 
a version of expectation-maximisation. In contrast we propose methods that work for 
several common neural network algorithms among them both discriminative and gen¬ 
erative models. 

In the integer programming domain a method called Randomized rounding (RR) m 
has been shown to be effective in online gradient descent on the convex problem of 
logistic regression; in this case an upper bound on the cost introduced by RR can be 
given fT6l . We apply the same method and other methods to neural network algorithms 
and also address the problem of the resolution of rounding probabilities. 

A very recent paper El examines the impact of low resolution synapses in deep 
learning architectures. [17| focuses on different representations of low precision num¬ 
bers (fixed point and floating point with different allocations of bits) with standard 
rounding, rather than algorithmic methods that intrinsically require lower resolution, 
as we do. These two approaches may well be complementary and yield best results 
when combined. 


1.3 This Paper 

In this paper we are interested in mapping standard neural network algorithms that 
use essentially continuous parameters onto equivalent ones that use low-resolution pa¬ 
rameters. The practice of transitioning between a discrete problem and its continuous 
analogue, is well known in integer programming as integer relaxation tm 

To transition from the relaxed problem (a standard neural network algorithm) to a 
low synaptic weight resolution version thereof, we investigate the use of two methods: 
one based on randomized rounding, and the other on a variation of an image compres¬ 
sion technique based on k-means. 

In particular we apply these methods to reduce the resolution of the weights in neu¬ 
ral networks down to 2-bit resolution, while still maintaining acceptable performance 
figures. We show that these methods work substantially better than the naive approach 
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based on normal rounding. This result is relevant for fixed point hardware implementa¬ 
tions of neural network algorithms and resolves a problem in the applicability of ANNs 
as models of biological neural networks. 


2 Mapping Methods 

The two methods proposed for mapping continuous weight algorithms to low resolution 
ones differ in the way they update the synaptic weights of the neural networks: the 
first method is based on randomized rounding and works “online” in the sense that it 
changes the update procedure of the gradient descent at each update step; the second 
method is based on k-means and is an “offline” method, as it compresses a learned 
weight matrix after training. 

The benchmark that these algorithms are tested against is based on the most straight¬ 
forward technique of resolution reduction: normal rounding. For this benchmark we 
implemented a variant of gradient descent where at each time step the weight updates 
are rounded to fall onto values that are resolvable at the desired resolution. We refer to 
this method as online rounding. 

2.1 Randomized Rounding 

The first method we propose is used online, during training. It is makes use of the ran¬ 
domized rounding function: a function that maps a point in a continuous one dimen¬ 
sional space to a point on a discrete subspace. Specifically it maps it probabilistically 
to either the nearest point, or the second nearest point in the discrete subspace, with a 
probability that is inversely proportional to the distance to the corresponding point. 


Algorithm 1 Randomized Rounding 


1 

procedure RR(a, e) 

> a mapped to e-grid 

2 

s <— sign(a) 


3 

p Jfi ~ LvJ 

> probability to increase abs. val. 

4 

if p > random^ 0,1) then 


5 


> higher abs. val. grid point 

6 

else 


7 

a <— s • e|_^rj 

> lower abs. val. grid point 

8 

end if 


9 

return a 


10 

end procedure 



In the above |_*J denotes the floor- and [•] denotes the ceiling function. 

This randomized rounding method is applied during the gradient descent update 
that is part of all algorithms we study in this paper. The update step then looks as 
follows. 

We apply randomized rounding whenever a synaptic weight gets updated: Instead 
of being updated to a 32-bit floating point value, it gets updated to grid points xd G 
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Algorithm 2 RR Gradient Descent 

1: 

procedure update (0, dS.p.e) 

> randomized rounding gradient descent, 0: 


parameter, dO: gradient, rj: learning rate, e: grid spacing 

2: 

0 RR(0 - rj-dO , e) 


3: 

0 <— clip(6 , —1,1) 

> clip 6 to allowed range 

4: 

return 0 


5: 

end procedure 



[—1,1] with spacing e (i.e. Xd G {n • e D [—1,1] |n E N}). Where e is chosen so that 
2 Z — 1 grid points are available in total. We call this the online stochastic method with 
i bits in the following plots. 

Since in a hardware implementation the resolution of the probability in the RR 
procedure might be critical, we also ran this method with limited resolutions in p (the 
resolution of p was set equal to the resolution of the weights). The resolution of p was 
reduced by standard rounding. We refer to this as the coarse p method in the following. 
Notably this method does not rely on any high-resolution result. 

2.2 K-Means 

In this method we first train the neural network with high-resolution parameters, and 
then use a technique taken from image compression (based on the k-means algorithm 
G2)) to extract k mean weight intensities. After clustering, the value of each pixel is 
set to the value of the center of the cluster it belongs to. In this offline method the full 
weight resolution is needed during training. In principle the clustering procedure could 
also be applied at every step of gradient descent, which would yield an online method 
in some sense, but compared to RR k-means is very expensive computationally and 
needs ‘non-local’ information. 

This method requires additional storage for the cluster centre values so that the 
memory requirement is increased by k • log 2 (p), where p is the precision of the center 
value. Note that this does not scale with the matrix size n 2 and is negligible for n 2 k. 
Since this is the regime we are interested in, we will neglect this term in the following. 
We will refer to this method as offline k-means. 


3 Results and Discussion 

We applied the aforementioned mapping methods to three types of neural networks: 
Multi-layer perceptron (MLP) 0, restricted Boltzmann Machine (RBM) |l]] and neural 
autoregressive distribution encoder (NADE) 0 For all of these we investigated the 
impact of varying the parameter resolution under constant hidden layer size and, for 
the MLP and NADE, under constant weight matrix memory (scaling the resolution by 
a factor of a also scales the size of the hidden layer by 1/a). 

The minimal resolution we consider is a 2-bit one, because these algorithms need 
at least three different values, a positive one, a negative one and zero. In neuromorphic 
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hardware this can translate to two species of synpases (excitatory and inhibitory) with 
binary weights. 

In the case of the RBM, it is difficult to give a scalar measure of performance, 
because the log-likelihood of some given data under a known RBM model is computa¬ 
tionally intractable (unless the RBM is very small). To obtain a scalar measure for the 
performance of a generative model we applied our methods also to the NADE, an RBM 
inspired distribution learner of similar power, for which the log-likelihood assigned to 
some given data is tractable 0. To assess the performance of the RBM, samples and 
connection weights produced in the different conditions are plotted. 

The MLP and RBM were trained on a binarized version of the MNIST hand-written 
digits dataset f20l in a theano-based I2T1 GPU implementation of batch gradient de¬ 
scent. The performance measure for the MLP is the percentage of the test-set samples 
that were misclassified. 

The NADE was trained on the “dna” dataset from the libsvm webpage |22) using 
the code provided in the supplementary materials of 0 modified to allow our rounding 
methods. The performance measure for the NADE is the negative log-likelihood of the 
test set. 



n-bits per weight (fixed point) 

Figure 1: The performance on MNIST of the fully trained MLP as a function of the 
number of bits per weight. Standard gradient descent with 32-bit floating point weights 
reached 1.81. There are 10 categories and the chance level is at 90%. 

Figures [T] and [2] show how the performance changes as we increase the weight res¬ 
olution under fixed hidden layer size (500 units). We observe that even 2-bit weights 
can perform far above chance level and we see a monotonically improving performance 
with higher resolution and a decrease of the performance gain per added resolution bit 
ending in a plateau, whose floor lies near the performance of the standard gradient 
descent performance. The location of the plateau floor indicates a slightly poorer per¬ 
formance of the low resolution algorithm; this is expected, because the low resolution 
algorithm cannot resolve continuous parameter values so that in the end phase of the 
descent it will randomly jump around the minimum rather than reaching it (in the limit 
of infinite resolution the algorithms converge back to the high resolution algorithm and 
the plateau eventually reaches that level of performance). 

For the ‘coarse p’ method at 3-bit resolution performs similarly well as the 6-bit 
normal rounding method that uses equally much memory per weight update. In contrast 
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Figure 2: The performance on the ‘dna’ set of the fully trained NADE as a function 
of the number of bits per weight. Standard gradient descent with 32-bit floating point 
weights reached 84.6. 


for the NADE the normal rounding method at 10-bit resolution peformed at chance 
level, while the 5-bit ‘coarse p’ performed above chance. 



epoch 

Figure 3: The cross-validation and training error are very close for a 2-bit resolution 
MLP on MNIST. This indicates a high-bias model. 

The learning curves for a low resolution MLP (500 hidden units, 2-bit resolution) 
in Figure [3] show that for very low resolutions the model performs very similarly on 
the training as on the cross-validation set. This indicates that this model is limited by 
its expressive power, rather than by the learning algorithm (it has ‘high bias’ rather 
than ‘high variance’). In light of this randomized rounding can also be interpreted as a 
regularization procedure. 

Figures [4] and [5] show how the performances of NADE and MLP change as we in¬ 
crease the weight resolution while keeping the memory size of the weight matrix fixed 
at 400 bits. Under these conditions it is clearly preferable to choose an intermediate 
resolution. 

Figure [6] shows activation probabilities for samples given by RBMs with differ¬ 
ent weight-resolutions (all have hidden layer size 500) trained with PCD-15 (23). As 
with the other algorithms the quality improves with higher resolutions, but even 2-bit 
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n-bits per weight (fixed point) 

Figure 4: The performance on the ‘dna’ set of the fully trained MLP as a function of 
the number of bits per weight while keeping the matrix memory size constant. The 
minimum lies at 6. 



Figure 5: The performance on MNIST of the fully trained NADE as a function of 
the number of bits per weight while keeping the matrix memory size constant. The 
minimum lies at 8. 


weights already result in clearly recognisable digits (albeit noisy ones) for the random¬ 
ized rounding method. 

Figure [7] shows receptive fields learned in RBMs with varying weight resolutions. 
Notably there are some hidden units whose receptive fields Took’ very noisy for low 
resolution weights. However, it may well be the case that it is difficult to judge by eye 
what constitutes a ‘useful’ receptive field; conversely the weights for the 2-bit k-means 
method ‘look’ useful but do not produce good samples. 

A particularly interesting application of randomized rounding gradient descent, 
would be a neuromorphic neural network implementation with memristive synpapses 
that exhibit probabilistic switching l24l . For other algorithms it has already been pro¬ 
posed that this behaviour could be exploited in neuromorphic hardware 1741 . Thus it 
could be possible to implement the randomized rounding step directly in the memory 
unit, without need for a random number generator. 
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Figure 6: Samples (activations, not binarized) for 2, 4, 6, and 8-bit RBM trained with 
the four different resolution reduction methods. Inside each picture: Four different 
initial conditions (random test sample) horizontally arranged, over 3000 passes, printed 
every 1000 steps vertically arranged. 
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Figure 7: Final receptive fields for 2, 4, 6, and 8-bit RBM 


4 Conclusion 

We presented two methods to reduce the resolution of common neural network algo¬ 
rithms to very low resolutions, while maintaining comparatively good performance: 
Randomized rounding, an ‘online’ method for performing gradient descent with low 
resolution parameters and K-means rounding, a post-processing method that reduces 
the resolution of the weights in neural networks trained with normal high-resolution 
parameters. We applied these methods on the MLP, NADE and RBM neural network 
algorithms, showing a graceful degradation of performance with decreasing weight 
resolution. 

Using these techniques, the performance of the algorithms plateaued around the 
10-bit resolution mark for datasets and parameter ranges we studied; no substantial im¬ 
provement was made above this resolution and additional memory was better invested 
in larger hidden layers. The offline method based on k-means produced better results 
than the online method, and both performed substantially better than rounding. 

Overall we find that there are no fundamental problems with the use of even binary 
excitatory and inhibitory synapses (i.e. 2-bit weights) in the tested ANN algorithms. 
Such low resolution synapses are common in neuromorphic hardware and 2-bit weights 
is the lower bound for the resolution of biological synapses. 

Increasing to 6 or 8-bit resolution yielded substantial performance improvements 
with RR gradient descent. At very low resolutions it seems sensible to forgo learning 
on a dedicated hardware implementation, if it is not inherently required; then a system 
of the same memory size can deliver a better performance using an offline compression 
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of the weight matrix. 

Notably randomized rounding worked well as a mapping method to fixed point 
weights for all algorithms we tested and down to very low resolutions. We speculate 
that other gradient-descent-based algorithms may well be similarly compatible with 
randomized rounding. 
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